PELCRA Spoken Offline Corpora of conversational Polish are available via links in the table below.
Each corpus consists of speech recordings (in WAV format) and word-by-word transcriptions, which also include some non-speech events. The transcriptions are complemented with words and phones annotations (both in EAF format), and, if available, with video content (MP4 format) and PDF transcripts.
Metadata are provided in XML files listing information about the recordings (titles, topics, dates, and URLs), media available (audio, video, pdf), and annotation details (file, date, annotator, place, duration, with additional information about the speakers whenever such data was available). A Document Type Definition specifying the structure of the elements and attributes of an XML document is included in each of the corpora. SQLite database with all the corpora metadata is also available for download.
All PELCRA spoken offline corpora have altogether about 839 000 words.
|PELCRA_EMO||A corpus of focused interviews (people reflecting upon their emotions).||40||80||252,000||Download|
|PELCRA_LUZ||A corpus of open interviews.||21||42||213,000||Download|
|PELCRA_EMI||A corpus of Polish emmigrants to Scotland.||22||44||96,000||Download|
|PELCRA_PARL||Samples of spoken parliamentary data.||48||241||99,000||Download|
|PELCRA_YT||Samples of Polish YouTubers' videos.||25||106||49,000||Download|
|MOWA_MIAST||A corpus of Polish conversations recorded in the 1980s.||28||103||130,000||Download|
The following paper should be cited to fulfill the CC attribution condition of the license for these resources:
This website was last updated on 7 January 2020.